Comparing Corpora And Lexical Ambiguity

نویسندگان

  • Patrick Ruch
  • Arnaud Gaudinat
چکیده

In this paper we compare two types of corpus, focusing on the lexical mnbiguity of each of them. The first corpns consists mainly of newspaper articles and Hterature excerpts, while the second belc)ngs to the medical domain. To conduct the study, we have used two different disambiguation tools. However, first of all, we must verify the performance of each system in its respective application domain. We then use these systems in order to assess and compare both the general ambiguity rate and the particularities of each domain. (mantitative results show that medical documents are lexically less ambiguous than tmrestrieted documents. Our conclusions show the importance of the application area in the design of NLP tools. Introduction and background Although some large-scale evaluations carried out on unrestricted texts (Hersh 1998a, SparkJones 1999), and even on medical documents (Hersh 1998b), conclude in a quite critical way about using NLP tools for information retrieval, we believe that such tools are likely to solve some lexical ambiguity issues. We also believe that some special settings -particular to the application areamust be taken into account while developing such NLP tools. Let us recall two major problems while retrieving documents with NLP engines (Salton, 1988): 1-Expansion: the user is generally as interested in retrieving documents with exactly the same words, as in retrieving documents with semantically related words (synonyms, generics, specifics...). Thus, a query based on the word liver, should be able to retrieve documents containing words such as hepatic. This expansion process is usually thesaurus-based. The thesaurus can be built manually or automatically (as, for ex~ple , in Nazarenko, 1997). 2-Disambiguation: a search based on tokens may retrieve irrelevant documents since tokens are often lexically ambiguous. Thus, face can refer to a body part, as a noun, or an action, as a verb. Finally, this latter problem may be split into two sub problems. The disambiguafion task can be based on parts-of-speech (POS) or word-sense (WS) information, but the chronological relation is still a discussion within the community. Although, the target of our work (Ruch and al., 1999, Bouillon and al., 2000) is a free-grained semantic disambiguation of medical texts for IR purposes, we believe that the POS disambiguation is an important preliminary step. Therefore this paper focuses on POS tagging, and compares morpho-syntacfic lexical ambiguities (MSLA) in medical texts to MSLA in unrestricted corpora. Although the results of the study conform to preliminary naive expectations, the method is quite original I. Most of the comparative studies, dedicated to corpora, have addressed the problem by applying metrics on words entities or word pieces (as in studies working with nI We do not claim to be pioneer in the domain, as others authors (Biber 1998, Folch and al., 2000) axe exploring similar metrics. However, it is interesting to notice that for these authors the adaptation of the NLP tools has rarely been questioned in a technical point-of view, and in order to feed back the design of NLP systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Gesture and its impact of resolving lexical ambiguity

The study aimed to shed light on the use of gesture in resolving lexical ambiguity employed by TEFL students. To this end, 60 intermediate Iranian learners, studying at Kish Way Language School in Iran were recruited. The participants were randomly put into two experimental groups and one control group. Both of the experimental groups received the same teaching approach, i.e. teaching homonyms ...

متن کامل

Comparing Lexical Bundles in Hard Science Lectures; A Case of Native and Non-Native University Lecturers

Researchers stated that learning and applying certain set of lexical bundles of native lecturers by non-native lecturers would help students improve their proficiency through incidental vocabulary input. The present study shed light on the lexical bundles in hard science lectures used by Native and Non-native lecturers in international universities with the main purpose of analyzing the structu...

متن کامل

Reducing Lexical Ambiguity in Serbo-Croatian

This paper presents an approach to acquisition of some lexical and grammatical constraints from large corpora using genetic algorithms. The main aim is to use these constraints to automatically define local grammars that can be used to reduce lexical ambiguity usually found in an initially tagged text. A genetic algorithm for computation of the minimal representation of grammatical features of ...

متن کامل

Entropy and Redundancy of Japanese Lexical and Syntactic Compound Verbs

The present study investigated Japanese lexical and syntactic compound verbs (V1þV2) using Shannon’s concept of entropy and redundancy calculated using corpora from the Mainichi Newspaper and a collection of selected novels. Comparing combinations of a V2 verb with various V1 verbs, syntactic compounds were higher in entropy than lexical ones while neither differed in redundancy. This result su...

متن کامل

The Role of Non-Ambiguous Words in Natural Language Disambiguation

This paper describes an unsupervised approach for natural language disambiguation, applicable to ambiguity problems where classes of equivalence can be defined over the set of words in a lexicon. Lexical knowledge is induced from non-ambiguous words via classes of equivalence, and enables the automatic generation of annotated corpora. The only requirements are a lexicon and a raw textual corpus...

متن کامل

A Rule-Based and MT-Oriented Approach to Prepositional Phrase Attachment

Prepositional Phrase is the key issue in structural ambiguity. Recently, researches in corpora provide the lexical cue of prepositions with other words and the information could be used to partly resolve ambiguity resulted from prepositional phrases. Two possible attachments are considered in the literature: either noun attachment or verb attachment. In this paper, we consider the problem from ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000